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Abstract 

In this paper we propose and study a generalization of the standard active-learning model where a 
more general type of query, class conditional query, is allowed. Such queries have been quite useful 
in applications, but have been lacking theoretical understanding. In this work, we characterize the 
power of such queries under two well-known noise models. We give nearly tight upper and lower 
bounds on the number of queries needed to learn both for the general agnostic setting and for the 
bounded noise model. We further show that our methods can be made adaptive to the (unknown) 
noise rate, with only negligible loss in query complexity. 

1. Introduction 

The ever-expanding range of application areas for machine learning, together with huge increases in 
the volume of raw data available, has encouraged researchers to look beyond the classic paradigm 
of passive learning from labeled data only. Perhaps the most extensively used and studied technique 
in this context is Active Learning, where the algorithm is presented with a large pool of unlabeled 
examples (such as all images available on the web) and can interactively ask for the labels of exam- 
ples of its own choosing from the pool. The aim is to use this interaction to drastically reduce the 
number of labels needed (which are often the most expensive part of the data collection process) in 
order to reach a low-error hypothesis. 

Over the past ten years there has been a great deal of progress on under standing active learn- 
ing a nd its underlving principles (Balcan. Bevgelzirner. and LangfordL 2006LBalcan, B roder, and 
Zhang, 'Bevgelzimer. Dasgupta. and Langfordl I2OO9I: ICastro and Now^ boo V: Das gupta, 

Hsu, and Monteleoni. 2007: Hanneke. 2007a: Balcan. Hanneke. and WortmanL 12008 : Hanneke . 
20091 : iKoltchinskiil l201ol:IWang[ |2C)09l:lBevgelzimer. Hsu. Langford. and Zhanglboioh. However, 
while useful in many applications (jMcCallum and NigamL 1 19981 : iTong and KoUen. |2001|) . request- 
ing the labels of select examples is only one very specific type of interaction between the learning 
algorithm and the labeler. When analyzing many real world situations, it is desirable to consider 
learning algorithms that make use of other types of queries as well. For example, suppose we are 
actively learning a multiclass image classifier from examples. If at some point, the algorithm needs 
an image from one of the classes, say an example of "house", then an algorithm that can only make 
individual label requests may need to ask the expert to label a large number of unlabeled examples 
before it finally finds an example of a house for the expert to label as such. This problem could be 
averted by simply allowing the algorithm to display a list of around a hundred thumbnail images 
on the screen, and ask the expert to point to an image of a house if there is one. The expert can 
visually scan through those images looking for a house much more quickly than she can label every 
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one of them. So in this case, we get a significant increase in power by being able to ask a particular 
type of query. In fact, queries of this ty pe have been quite useful in sev eral applications (Chang, 
Tong, Goh, and Chang, 2005 : Doyle. Monaco. Feldman. Tomaszewski. and Madabhushii.i2009,) . but 
unfortunately, they have been lacking a principled theoretical understanding. 

In this work we expand the study of active learning by considering a model that allows us to 
analyze queries motivated by such applications. Specifically, the query protocol we analyze, namely 
class-conditional queries, is based on the ability to ask for an example of a given label within a 
given set of unlabeled examples. That is, the algorithm is provided with a large pool of unlabeled 
examples, and may interact with an oracle as follows. In each query, the algorithm proposes a label 
and a subset of the unlabeled examples, and asks the oracle to point to one of these examples whose 
true label agrees with the specified label, if any exist. This is a strict generalization of the traditional 
model of active learning by label requests. 

It is well known that if the target function resides in a known concept class and there is no 
classificati on noise (the so-c alled realizable case), then a simple approach based on the Halving 
algorithm ( Littlestone , 1988 b can learn a function e-close to the target function using a number of 



queries dramatic ally smaller than the number of random labeled examples required for PAC learning 



(lHannekeLl2009h . 

Encouraged by such strong results for the realizable case, we may wonder whether equally 
strong reductions in query complexity are feasible in the presence of classification noise. In the 
present work, we find that in the general agnostic case, this is not true when the noise rate is large, 
though a different type of reduction is consistently possible: namely, reduction by a factor related to 
the overall noisiness of the data. While this reduction is much more modest than those achievable 
in the realizable case, the fact that it is consistently available is interesting, in that it contrasts 
with active learning, where th e known i r nprove m ents over passive learning vary depending on the 
structure of the concept space (lHannekell2()()7allbl : iDasgupta. Hsu, and Monteleo ni. 2007). We also 



prove a sometimes stronger result in the special case of bounded noise: namely, that compared to 
active learning, the query complexity with class conditional queries is reduced by a factor related 
to the noise bound. 

Our Results We provide the first general results concerning the query complexity of class-conditional 
queries in the presence of noise in a multiclass setting. In particular: 

1. In the purely agnostic case with noise rate r], we show that any interactive learning algorithm in 
this model seeking a classifier of error at most rj + e must make ^{drj^ /e^) queries, where d is 
the Natarajan dimension; we also provide a nearly matching upper bound of 0{drf' /e^), for a 
constant number of classes. This is smaller by a factor of r/ compared to the sample complexity 
of passive learning, and represents a reduction over the known results for the query complexity 
of active learning in many cases. 

2. In the bounded noise model, we provide nearly tight upper and lower bounds on the query 
complexity of the general query model as a function of the query complexity of active learning. 
In particular, we find that the query complexity of the general query model is essentially reduced 
by a factor of the noise bound, compared to active learning. 

3. We further show that our methods can be made adaptive to the (unknown) noise rate rj, with 
only negligible loss in query complexity. 
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Overall, we find that the reductions in query complexity for this model, compared to the tra- 
ditional active learning model, are largely concerned with a factor relating to the noise rate of the 
learning problem, so that the closer to the realizable case we are, the greater the potential gains in 
query complexity. However, for larger noise rates, the benefits are more modest, a fact that sharply 
contrasts with the enormous benefits of using these types of queries in the realizable case; this is 
true even for very benign types of noise, such as bounded noise, a fact that may seem surprising, es- 
pecially since the query complexity of the traditional active learning model is essentially unchanged 
(up to const a nt and log factors) by the presence of bounded noise, compared to the realizable case 



(|Kaariainen . mm) . We hope our analysis will help inform the use of these queries in practical 



learning problems, as well as provide a point of reference for future exploration of the general topic 
of interactive machine learning. 

2. Formal Setting 

We consider an interactive learning setting defined as follows. There is an instance space X, a label 
space y, and some fixed target distribution Vxy over X y^y, with marginal T>x over X. Focusing 
on multiclass classification, we assume that y = {1,2,..., k}, for some A; G N. In the learning 
problem, there is an i.i.d. sequence of random variables (xi, yi), (x2, 2/2), (a^s, ys), • • ., each with 
distribution Vxy- The learning algorithm is permitted direct access to the sequence of Xi values 
(unlabeled data points). However, information about the yi values is obtainable only via interaction 
with an oracle, defined as follows. 

At any time, the learning algorithm may propose a label I ^ y and a finite subsequence of 
unlabeled examples S = {xj^, (for any m € N); if yi^ 7^ I for all j < m, the oracle 

returns "none." Otherwise, the oracle selects an arbitrary Xj G 5 for which yi - = I and returns the 
pair {xi^^yi-). In the following we call this model the CCQ (class-conditional queries) interactive 
learning model. Technically, we implicitly suppose the set S also specifies the unique indices of the 
examples it contains, so that the oracle knows which yi corresponds to which Xi - in the sample S; 
however, we make this detail implicit below to simplify the presentation. 

In the analysis below, we fix a set of classifiers h : X ^ y called t he hyp othesis class, 
deno ted C. We will denote bv d the Nataraian dimension of C (NataraianL 1989 : Haussler and 



Long, 1 19951 : iBen-David. Cesa-Bianchi. Haussler. and LongL 119951) . defined as the largest m € N 



such that 3(ai,6i,ci),...,(am,&m,Cm) & X x y x y such that {61, ci} x ••• x {bm,Cm] ^ 
{(/i(oi), . . . , h{am)) '■ h G C}. The Natarajan dimension has been calculated for a variety of hy- 
pothesis classes, and is known to be related t o several other commonly used dimensions, including 
the pseudo-dimension and graph dimension ^Haussler and Lon^ 1995: Ben-David, Cesa-Bianchi, 



Haussler, and Long. 1995). For instance, for neural net works of n nodes w ith weights given by 6-bit 



integers, the Natarajan dimension is at most bn{n — 1) (iNatarajani 119891) . 

For any h : X ^ y and distribution P over X x y, define the error rate of h as errp(/i) = 
P(x,y)~p{^(^) 7^ y}', when P = Vxy, we abbreviate this as err(/i). For any finite sequence of 
labeled examples L = {(xj^,yjj, . . . , (2;i,„, 2/i„J}, we define the empirical error rate err L{h) = 
\L\^^ J2{x y)eL '^[h{x) 7^ y]. In some contexts, we also refer to the empirical error rate on a finite 
sequence of unlabeled examples U = {xj^, . . . ,Xi^}, in which case we simply define err[/(/i) = 
\U\~^ gC/ ) 7^ where the yi^ values are the actual labels of these examples. 

Let h* be the classifier in C of smallest err(/i*) (for simplicity, we suppose the minimum is 
always realized), and let r/ = err(/i*), called the noise rate. The objective of the learning algorithm 
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is to identify some h with err(/i) close to r/ using only a small number of queries. In this context, a 
learning algorithm is simply any algorithm that makes some number of queries and then halts and 
returns a classifier. We aie particularly interested in the following quantity. 

Definition 1 For any e,(5 G (0, 1), any hypothesis class C, and any family of distributions D on 
X y, define the quantity QCccQ(e, 5, C, D) as the minimum g G N such that there exists a 
learning algorithm A, which for any target distribution Vxy G ^> with probability at least 1 — 5, 
makes at most q queries and then returns a classifier h with err{h) < i] + e. We generally refer to 
the function QCqqq{-, •, C, B) as the query complexity of learning C under B. 

The query complexity, as defined above, represents a kind of minimax statstical analysis, where 
we fix a family of possible target distributions B, and calculate, for the best possible learning algo- 
rithm, how many queries it makes under its worst possible target distribution Vxy in B. Specific 
families of target distributions we will be interested in include the random classification noise model, 
the bounded noise model, and the agnostic model which we define formally in the corresponding 
sections. In some contexts, we may also discuss the query complexity achieved by a particular algo- 
rithm, in which case it is merely the same definition as above except replacing A with the particular 
algorithm in question. 



3. The General Agnostic Case 

We start by considering the most general, agnostic setting, where we consider arbitrary noise dis- 
tributions subject to a constraint on the noise rate. This is particularly relevant to many practical 
scenarios, where we often do not know what type of noise we are faced with, potentially including 
stochastic labels or model misspecification, and we would therefore like to refrain from making any 
specific assumptions about the nature of the noise. Formally, the family of distributions we consider 
is yignostic(C, a) = {Vxy '■ inf/igc ei'i'(^) ^ a ^ [0, 1/2). In this section we prove nearly 
tight upper and lower bounds on the query complexity of our model. Specifically, supposing k is 
constant, we have the following theorem. 

Theorem 2 For any hypothesis class C of Natarajan dimension d, for any rj € [0, 1/32), 

QCccQ(e,<^,C,yignostic(C,r/)) = 6 (^d^^ . 

The first interesting thing is that our bound differs from the sample complexity of passive learn- 
ing only in a factor of rj. This contrasts with the realizable case, where it is possible to learn with a 
query complexity that is exponential smaller than the query complexity of passive learning. On the 
other hand, is also interesting that this factor of rj is consistently available regardless of the structure 
of the concept space. T his contrasts with active learning where the extra factor of rj is only available 
in certain special cases ( Hanneke , 2007 ah . 



3.1 Proof of the Lower Bound 

We first prove the lower bound. We specifically prove that for < 2e < r/ < 1/4, 



QCccQ(e,l/4,C,yignostic(C,r?)) = n (^d'pj . 
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Monotonicity in 6 extends this to any 6 G (0, 1/4]. 

Proof The key idea of the proof is to provide a reduction from the (binary) active learning model (la- 
bel request queries) to our multiclass interactive learning model (general class-conditional queries) 
for the hard case know n previouslv in the literature for the active learning model ( Bevgelzimer. 
Dasgupta, and Langford, 2009|) . 

In particular, consider a set of d points xq, xi, X2,.--, Xd-i shattered by C, and let {uqjZq), 
. . . , {ud-i, Zd-i) be the label pairs that witness the shattering. Here is a distribution over X x y 
: point xq has probability 1-/3, while each of the remaining Xi has probability /3/((i — 1), where 
(3 = 2{i] + 2e). At xq the response is always Y = y^. Ai Xi, \ < i < d — 1, the response is y = Zj 
with probability 1/2 + 76^ and Y = yi with probability 1/2 — 76,, where hi is either +1 or —1, and 
7 = 26//3 = e/(r? + 26). 

Beygelzimer, Dasgupta, and Langford (120091) show that for any active learning algorithm, one 
can set 60 = 1 and all the 6i,iG{l,...,d — l}ina certain way so that the algorithm must make 
J7(dr/^/e^) queries in order to output a classifier of error at most 1] + e with probability at least 1/2. 
Building on this, we can show any interactive learning algorithm seeking a classifier of error at most 
r] + e must make ^.{dif /e^) queries to succeed with probability at least 1/2. 

Assume that we have an algorithm A that works for the CCQ model with query complexity 
QCccQ(e, (5, C,yignostic(C, ?])). We show how to use ^ as a subroutine in an active learning 
algorithm that is specifically tailored to the above hard set of distributions. 

In particular, we can simulate an oracle for the CCQ algorithm as follows. Suppose our CCQ 
algorithm queries with a set Si for a label I. If £ is not one of the yo, . . . , yd^i, zq, . . . , Zd~i labels, 
we may immediately return that none exist. If there exists Xij G Si such that Xij = xq and 
£ = zo, then we may simply return to the algorithm this {xij, zq). Otherwise, we need only make 
(in expectation) 1^2-^ ^^^^^^ learning queries to respond to the class-conditional query, as follows. 
We consider the subset Ri of Si of points Xij among those xj with £ G {yj,Zj}. We pick an 
example x-^^ at random in Ri and request its label y^^\ If x.^^ has label y^^^ = £, then we return 
to the algorithm {xf'\y^^'^); otherwise, we continue sampling random x^^\xl^\ . . . points from 

(2) (3) 

Ri (whose labels have not yet been requested) and requesting their labels y - ,yl , . . . , until we 
find one with label £, at which point we return to the algorithm that example. If we exhaust Ri 
without finding such an example, we return to the algorithm that no such point exists. Since each 
Xjj G Ri has probability at least 1/2 — 7 of having yj j = £, we can answer any query of A using 
in expectation no more than yJ^^ label request queries. 

In particular, we can upper bound this number of queries by a geometric random variable and 
apply concentration inequalities for geometric random variables to bound the total number of label 
requests, as follows. Let Ai be a random variable indicating the actual number of label requests 
we make to answer query number i in the reduction above, before returning a response. We can 
show that For j < Ai, if h*{x'f^) / £, let Zj = I[y\^^ = £], and if h*{x^p) = £, let Cj be an 

independent Bemoulli((l/2 — 7)/(l/2 + 7)) random variable, and let Zj = Cjllyf^ = £]. For 
j > Ai, let Zj be an independent Bemoulli(l/2 — 7) random variable. Let Bi = min{j : Zj = 1}. 

Since, Vj < Ai, Zj < /[yp^ = £], we clearly have Bi > Ai. Furthermore, note that the Zj 
are independent Bernoulli(l/2 — 7) random variables, so that Bi is a Geometric(l/2 — 7) ran- 
dom variable. By Lemma [13] in Appendix |Al we obtain that with probability at least 3/4 we 
have that if Q is any constant and A makes < Q queries, then with probability at least 3/4, 
Si — Yl?=i ^ 1/2^-7 + 41n(4)]. Thus, since Ai represents the total number of la- 
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bel requests made by this algorithm, and we know that with probabiUty at least 3/4 the number 
of queries is at most Q = QCrir^nfe, 1/4, C,yiKnostic(C, 77 )), combining this together with the 
aforementioned (|Beygelzimer. Dasgupta. and Langfordl l2009t) lower bound for active learning, we 
obtain the result. ■ 



3.2 Upper bound 

In this section we describe an algorithm whose query complexity is O (^kd^^ . For clarity, we start 
by considering in the case where we know an upper bound /3 on 77. This procedure (Algorithm [Jl 
has two phases: in Phase [TJ it uses a robust version of the classic halving algorithm to produce a 
classifier whose error rate is at most 10(/3 + e) by only using O [kdlog ^) queries. In Phase|2l we 

run a refining algorithm that uses O ^fcd^^ queries to turn the classifier output in phase one into a 
classifier of error r] + e. We will discuss how to remove the assumption of knowing an upper bound 
P on 7], adapting to r], in Section [T2l 

Algorithm 1 General Agnostic Interactive Algorithm 

Input: The sequence {xi,X2, ); values u, s, 6; budget n (optional; default value = 00). 

1. Let V hea (minimal) e-cover of the space of classifiers C with respect to Vx- Let U be {xi, ...,Xu}- 

2. Run the Generalized Halving Algorithm (Phase [B with input U; V, s, cln ^'°^| n/2, and get h 
returned. 

3. Run the Refining Algorithm (Phase|2]i with input U, h, n/2, and get labeled sample L returned. 

4. Find a hypothesis h' E V of minimum eiTL{h'). 
Output Hypothesis h' (and L). 

Before presenting and analyzing the main steps of our algorithm, we start by describing a useful 
definition and a useful subroutine (Subroutine [TJ Find-Mistake). Given F C C, we define the 
plurality vote classifier as 

plur(y)(x) = argmax( > {I[h{x) = y])- 



Subroutine 1 Find-Mistake 

Input: The sequence S ~ {xi^X2, ■ ■ ■ , Xm)\ classifier h 

\. For each ?/£ {1, fc}, 

(a) Query the set {x E S : h{x) ^ y} for label y 

(b) If received back an example (x, y), return [x, y) 

2. Return "none" 

Note that, if errs{h) > 0, then Find-Mistake returns a labeled example {x,y) with y the true 
label of X, such that h{x) / y, and otherwise it returns an indication that no such point exists. 



6 



Robust Interactive Learning 



Phase 1 Generalized Halving Algorithm 

Input: The sequence U — {xi, X2, a;ps); set of classifiers V; values s, N; budget n {n optional: default 
value ~ oo). 

1. Setfe = true, t ^ 0. 

2. while (6 and t < n - N) 

(a) Draw Si, S2, Sn of size s uniformly without replacement from U. 

(b) For each i, call Find-Mistake with arguments Si, and plur(l^). If it returns a mistake, we record 
the mistake {xi, jji) it returns. 

(c) If Find-Mistake finds a mistake in more than N/ 3 of the sets, remove from V every h E V making 
mistakes on > N/9 examples (i;, jji), and set t <— i + iV; else 6 4— 0. 

Output Hypothesis plur(y). 
Phase 2 Refining Algorithm 

Input: The sequence U ~ {xi,X2, Xps)', classifier h; budget n {n optional: default value = 00). 

1. Set6 = 1, t = 0, = C/, L = 0. 

2. while (6 and t < n) 

(a) Call Find-Mistake with arguments W, and h. 

(b) If it returns a mistake (i, y), then set L ^ L U {(i, y)}, W \ {x}, and t ^ t + 1. 

(c) Else set 5 = and L ^ L U {{x, h{x)) : x G W}. 

Output Labeled sample L. 



Lemma [3] below characterizes the performance of Phase [T] and Lemma |4] characterizes the per- 
formance of Phase |2l Note that the budget parameter in these methods is only utiUzed in our later 
discussion of adaptation to the noise rate. 

Lemma 3 Assume that some h E V has exxuiji) < fiforfi G [0,1/32]. With probability > 1 — 6/2, 
running Phase\l}with U, and values s = and = cln (for an appropriate constant 

c € (0, 00) j, we have that for every round of the loop of Step 2, the following hold. 

• h makes mistakes on at most N/9 of the returned {xi,yi) examples. 

• 7/'err(/(plur(y)) > 10/3, then Find-Mistake returns a mistake for plur(y) on > N/3 of the sets. 

• If Find-Mistake returns a mistake for plur{V) on > N/3 of the sets Si, then the number ofh in 
V making mistakes on > of the returned {xi,yi) examples in Step 3(6) is at least (l/4)|y|. 



Proof Phase [Hand Lemma |3] are inspired by the analysis of Hanneke ( 2007b l). In the following, by 



a noisy example we mean any Xi such that h{xi) / y^. The expected number of noisy points in any 
given set Si is at most 1/16, which (by Markov's inequality) implies the probability Si contains a 
noisy point is at most 1/16. Therefore, the expected number of sets 5^ with a noisy point in them is 
at most A^/16, so by a Chernoff bound, with probability at least 1 — (5/(4 log2 \V\) we have that at 
most A^/9 sets Si contain any noisy point, establishing claim 1. 



7 



Balcan and Hanneke 



Assume that err(/(plur(y)) > 10/3. The probabihty that there is a point Xi in Si such that 
plur(y) labels Xi differently from is > 1 — (1 — lO/Sy > .37 (discovered by direct optimization). 
So (for an appropriate value of c > in A^) by a Chernoff bound, with probability at least 1 — 
(5/(41og2 \V\), at least of the sets Si contain a point Xi such that plm{V){xi) ^ jji, which 
establishes claim 2. Via a combinatorial argument, this then implies with probability at least 1 — 
5/ (4 log2 \V\), at least |y |/4 of the hypotheses will make mistakes on more than of the sets Si. 
To see this consider the bipartite graph where on the left hand side we have all the classifiers in V 
and on the right hand side we have all the returned {xi,yi) examples. Let us put an edge between a 
node i on the left and a node j on the right if the hypothesis hi associated to node i makes a mistake 
on (xj, iji). Let M be the number of vertices in the right hand side. Clearly, the total number of 
edges in the graph is at least (l/2)|y|[Af |, since at most \V\/2 classifiers label Xi as yi. Let a\V\ 
be the number of classifiers in V that make mistakes on at most {xi,yi) examples. The total 
number of edges in the graph is then upper bounded by a|y|A^/9 + (1 — a)|y|M. Therefore, 

(l/2)|y||M| < a\V\N/9 + (1 - a)\V\M, 

which implies 

|y||M|(a - 1/2) < a\V\N/9. 

Applying the lower bound M > iV/3, we get {N/3)\V\{a - 1/2) < a|y|iV/9, so a < 3/4. This 
establishes claim 3. 

A union bound over the above two events, as well as over the iterations of the loop (of which 
there are at most log2 \ V\ due to the third claim of this lemma) obtains the claimed overall 1 — 6/2 
probability. ■ 

Lemma 4 Suppose some h has errij{h) < f3, for some ji G [0,1/32]. Running Phase ^with 
parameters U, h, and any budget n, if L is the returned sample, and \L\ = \U\, then every (xi, y) € 
L has y = yi (i.e., the labels are in agreement with the oracle's labels); furthermore, \L\ = \U\ 
definitely happens for any n > (3\U\ + 1. 

Proof Every call to Find-Mistake returns a new mistake for h from U, except the last call, and 
since there are only f3\U\ such mistakes, the procedure requires only (3\U\ + 1 calls to Find-Mistake. 
Furthermore, every label was either given to us by the oracle, or was assigned at the end, and in this 
latter case the oracle has certified that they are correct. 

Formally, if \L\ = \U\, then either every x ^ U was returned as some (x, y) pair in Step 2.b, or 
we reached Step 2.c. In the former case, these y labels are the oracle's actual responses, and thus 
correspond to the true labels. In the latter case, every element of L added prior to reaching 2.c was 
returned by the oracle, and is therefore the true label. Every element {xi,y) £ L added in Step 2.c 
has label h{xi), which the oracle has just told us is correct in Find-Mistake (meaning we definitely 
have h{xi) = yi). Thus, in either case, the labels are in agreement with the true labels. Finally, note 
that each call to Find-Mistake either returns a mistake for h we have not previously received, or is 
the final such call. Since there are at most P\U\ mistakes in total, we can have at most P\U\ + 1 
calls to Find-Mistake. ■ 

We are now ready to present our main upper bounds for the agnostic noise model. 
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Theorem 5 Suppose f3 > rj, and /? + e < 1/32. Running Algorithm [7] with parameters u 



0(d((/3 + e)/62)log(A:M)), 



and 6, with probability at least 1 — 6 it produces a 



_16{/3+e)_ 

classifier h' with err(/i') < r/ + e using a number of queries O ( kd^ log + kd log 



log(l/^) 



logi 



Proof We have chosen u large enough so that err{/(/i*) <r/ + e</3 + e, with probability at least 
1 — (5/4, by a (multiplicative) Chernoff bound. By Lemma[3l we know that with probability 1 — 5/2, 
h* is never discarded in Step 2(c) in PhaselH and as long as err[/(plur(y)) > 10(/3 + e), then we 
cut the set \V\ by a constant factor. So, with probability 1 — 35/4, after at most 0(A;A^ log(|y|)) 
queries, Phase[T]halts with the guarantee that err[/(plur(y)) < 10(/3 + e). Thus, by LemmalU the 
execution of Phase [2] returns a set L with the true labels after at most (10(/3 + e)u + l)k queries. 

Furthermore, we can choose the e-cover V so that \ V\ < 4(cfc^/e)'^ for an appropriate constant 
c (Ivan der Vaart and Wellneii 1 19961 : iHaussler and Longi 1 1995b . 

Therefore, by Chernoff and union bounds, we have chosen u large enough so that the h' of 
minimal ervu^h') has err(/i') < rj + e with probability at least 1 — J/4. Combining the above 
events by a union bound, with probability 1 — 6, the h' chosen at the conclusion of Algorithm [T] has 
err(/i') < 7] + e and the total number of queries is at most 



A;iVlog4/3(|y|) + A:(10(/3 + e)n + l) 



O kdlog ' ' log - + kd ^ ^ ' log 
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In particular, if we take P = rj. Theorem |5]impUes the upper bound part of Theorem |2] 
Note: It is sometimes desirable to restrict the size of the sample we make the query for, so that 
the oracle does not need to sort through an extremely large sample searching for a mistake. To this 
end, we can run Phase |2] on chunks of size 1/(77 + e) from U, and then union the resulting labeled 
samples to form L. The number of queries required for this is still bounded by the desired quantity. 

In practice, knowledge of an upper bound (3 reasonably close to 77 is typically not available. As 
such, it is important to design algorithms that adapt to the unknown value of rj using only observable 
quantities. The following theorem indicates this is possible in our setting, without significant loss 
in query complexity. 

Theorem 6 There exists an algorithm that is independent of rj and Mri G [0, 1/2) achieves query 
complexity QCc(;;;Q(e, (5, C,yignostic(C, a)) = O [kd^^- 

Proof We consider the proof of this theorem in two stages, with the following intuitive motivation. 
First, note that if we set the budget parameter n large enough (at roughly 1/k times the value of 
the query complexity bound of Theorem O, then the largest value of /3 for which the algorithm 
(with parameters as in Theorem [5]l produces L with \L\ = u has /? > r?, so that it produces h' with 
err(/i') < 77 + e. So for a given budget n, we can simply run the algorithm for each {3 value in a 
log-scale grid of [e, 1], and take the h' for the largest such /3 with \L\ = u. The second part of the 
problem then becomes determining an appropriately large budget n, so that this works. For this, we 
can simply search for such a value by a guess-and-double technique, where for each n we check 
whether it is large enough by evaluating a standard confidence bound on the excess error rate; the 
key that allows this to work is that, if \L\ = u, then the set L is an i.i.d. Dxy -distributed sequence 
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of labeled examples, so that we can use known confidence bounds for working with sequences of 
random labeled examples. The details of this strategy follow. 

Consider values Uj = 2^ for j E N, and define the following procedure. We can consider 
a sequence of values r]i = 2^~* for i < log2(l/e). For each i = 1,2,... ,log2(l/e), we run 
Algorithm [U with parameters 

u = u, = 0{d{{rji + e)/e^)log{k/e6)), 
1 



5, = V(81og2(l/e)) 



and budget parameter nj/ log2(l/e). Let hji and Lji denote the return values from this execution 
of Algorithm [U and let hj and Lj denote the values hji and Lji, respectively, for the smallest value 
of i for which \Lji\ = Ui. that is, for which the execution of Phase |2]ran to completion. 

Note that for some j with rij = O (d4- log + dlog iHl!(V£) k\ j^g^ Theorem[5] 



implies that with probability 1 — 5/4, every i < [log2(l/??)J with \Lji\ = Ui has err (hji) < rj + e/2, 
and \Lji\ = Ui for at least one such i value: namely, i = [log2(l/ max{r], e})J. Thus, err(/ij) < 
+ e/2 for this value of j. Let j* denote this value of j, and for the remainder of this subsection 
we suppose this high-probability event occurs. 

All that remains is to design a procedure for searching over rij values to find one large enough to 
obtain this error rate guarantee, but not so large as to lose the query complexity guarantee. Toward 
this end, define 



,^ , led , / l2\Lj\p 

41 -V s 



6 



A result of I Vapnik d 19981) (except substituting the appropriate quantities for the multiclass case) 



implies that with probability at least 1 — 6/2, 



err ^ (hj) — minerrj^ (/i)^ — (^err{hj) — err{h*) 



Consider running the above procedure for j = 1, 2, 3, ... in increasing order until we reach the 
first value of j for which 

err; (hi) — minerr? (h) + < e. 

Denote this first value of j as j. Note that choosing j in this way guarantees err{hj) < rj + e. 

It remains only to bound the value of this j, so that we may add up the total number of queries 
among the executions of our procedure for all values j < j. By setting the constants in Ui appro- 
priately, the sample size of \Lj \ is large enough so that, for j = j*, a. Chernoff bound (to bound 
err j^ {h*) > err j^ {hj)) guarantees that with probability 1 — (5/4, £j < e/4. Furthermore, we have 

err ^ (hj) — minerij^^ (/i) < err{hj) — err{h*) + Zj < e/2 + e/4 = (3/4)e, 

so that in total err^^^ (hj) — miii/jgc err^ [h) + £j < (3/4)e + e/4 = e. Thus, we have j < j*, so 
that the total number of queries is less than 2nj* . 
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Therefore, by a union bound over the above events, with probability 1 — S, the selected hj has 
err(/i-.) < r] + e, and the total number of queries is less than 

2fcn,. =o(dk4 log log i + dk log log^ . 

Thus, not having direct access to the noise rate only increases our query complexity by at most a 
logarithmic factor compared to the bound of Theorem |2l ■ 



4. Bounded Noise 

In this section we study the Bounded noise model (also known as Massart noise), wh i ch ha s been 



extensively stud ied in the statistical learning theory literature (IMassart and Nedeled. l2006l: Gine 



and Koltchinskii, bood : iHanneM l2011h . This model represents a significantly stronger restriction 
on the type of noise. The motivation for bounded noise is that, in some scenarios, we do have an 
accurate representation of the target function within our hypothesis class (i.e., the model is correctly 
specified), but we allow for nature's labels to be slightly randomized. Formally, the family of 
distributions we consider is BN(C,a) = {Vxy : 3h* € C s.t. Fv^yi^ + < for 

a G [0, 1/2). In some cases, we are interested in the special case of Random Classification Noise, 
defined as RCN(C,a) = {Vxy : 3h* e C s.t. V£ / /i*(x),Po^y (y = e\X = x) = a/{k - 1)}. 
We will also discuss BN(C, a; Vx) and RCN(C, a; Vx) as those Vxy in these respective classes 
having marginal Vx on X. 

In this section we show a lower bound on the query complexity of interactive learning with class- 
conditional queries as a function of the query complexity of active learning (label request queries). 
The proof follows via a reduction from the (multiclass) active learning model (label request queries) 
to our interactive learning model (general class-conditional queries), very similar in spirit to the 
reduction given in the proof of the lower bound in Theorem |2l 

Theorem 7 Consider any hypothesis class C of Natarajan dimension d £ (0, cxd). For any a G 
[0, 1/2), and any distribution Vx over X, in the random classification noise model we have the 
following relationship between the query complexity of interactive learning in the class-conditional 
queries model and the the query complexity of active learning with label requests: 

a 1 
2(fc-l) Q'^AL(g, 2,5, C, RCN(C, a- Vx)) - 41n - < QCccqle, C, RCN(C, a; Vx)) 

Proof The proof follows via a reduction from the active learning model (label request queries) to our 
interactive learning model (general class-conditional queries). Assume that we have an algorithm 
that works for the CCQ model with query complexity QCccQ(e, (5, C, RCN(C, a; Vx))- We can 
convert this into an algorithm that works in the active learning model with a query complexity of 
QCAL(e, 2,5, C, RCN(C, a; Vx)) = ^^^[QCccQ(e> '5, C, RCN(C, a; Px))+41n \], as follows. 
When our CCQ algorithm queries the time, say querying for a label y among a set Si, we pick 
an example i at random in Si and (if the label of xi^i has never previously been requested), we 
request its label yi i. \fy = yi^i, then we return (xj i, i) to the algorithm, and otherwise we keep 
taking examples {xi^2, a^i,3) • • •) at random in the set Si and (if their label has not yet been requested) 
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requesting their labels {yi,2,yi,3, ■ ■ ■), until we find one with label y, at which point we return this 
labeled example to the algorithm. If we exhaust Si and we find example of label y, we return to the 
algorithm that there are no examples in Si with label y. 

Let Ai be a random variable indicating the actual number of label requests we make in round i 
before getting either an example of label y or exhausting the set Si. We also define a related random 
variable Bi as follows. For j < Ai, if h*{xij) / y, let Zj = I[yij = y], and if h*{xij) = y, let Cj 
be an independent Bernoulli((a/(A; — 1))/(1 — a)) random variable, and let = Cjl[yij = y]. For 
j > 74j, let be an independent Bernoulli(a/(fc—l)) random variable. Let Bi = min{j : Zj = 1}. 
Since, Vj < Ai, Zj < I[yij = y], we clearly have Bi > Ai. Furthermore, note that the Zj are 
independent Bernoulli(Q!/(A; — 1)) random variables, so that Bi is a Geometric(a/(A; — 1)) random 
variable. By Lemma [T3] in Appendix lAl we obtain that with probability at least 1 — 6 we have 

^Ai<^B,< ^i^_ll[QCccQ(e,<5,C,RCN(C,Q;Px)) + 41ni]. 

i i 

This then implies 

QCAL(e, 26, C, RCN(C, a; Vx)) < ^^^^[QCccqCe, S, C, RCN(C, a; Vx)) + 41n 1], 
which implies the desired result. ■ 



To complement this lower bound, we prove a related upper bound via an analysis of an algorithm 
below, which operates by reducing to a kind of batch-based active learning algorithm. Specifically, 
assume that we have an active learning algorithm A that operates as follows. It proceeds in rounds 
and in each round it interacts with an oracle by providing a region R of the instance space and a 
number m and and it expects in return m label ed examples from the conditional distributi on given 
that X is in R. For example the A^ algorithm fealcan. Beyge lzimer. and Langford , 2006 ^ and the 
algorithm of Koltchinskii (2010) can be written to operate this way. We show in the following how 
we can use our algorithms from Section |3] in order to provide the desired labeled examples to such 
an active learning procedure while using fewer than m queries to our oracle. In the description 
below we assume that algorithm A returns its state, a region R of the instance space, a number m 
of desired samples, a boolean flag h for halting(6 = 0) or not (6 = 1), and a classifier h. 

The value 5' in this algorithm should be set appropriately depending on the context, essentially 
as 5 divided by a coarse bound on the total number of batches the algorithm A will request the labels 
of; for our purposes a value 5' = poly(e5(l — 2a) /d) will suffice. To state an exp licit bound on the 
numb er of queries used by Algorithm |2l we first review the following definition of lHanneke (l2007aL 



20091). Recall that for r > 0, we define B{h, r) = {5 G C : P©^ {x : h{x) / g{x)) < r}. For any 



C C, also define the region of disagreement: DIS('H) = {x ^ X : 3h, g £ Ti s.t. h{x) ^ g{x)}. 
Then define the disagreement coefficient for /i € C as 

e^ie) = snpFT,^{BlS{B{h,r)))/r. 

Define the disagreement coefficient of the class C as 6{e) = sup^^g^ ^/iC^)- 

Theorem 8 For any concept space C of Natarajan dimension d, and any a S [0, 1/2), for any 
distribution Vx over X, 



QCccQ(e, S, C, BN(C, a;Vx)) = O (^(^1 + j^^^^ dk log' 



e5(l-2a). 
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Algorithm 2 General Interactive Algorithm for Bounded Noise 

Input: The sequence {xi,X2, ); allowed error rate e, noise bound a, algorithm A. 

1. Set b = 1, t = 1. Initialize A and let S{A), R, m, b and h be the returned values. 

2. Let V hea minimal e-cover of C with respect to the distribution Vx- 

3. While (6) 

(a) Letps = |t log ^ and let {xi^ , x^j, . . . , Xi^^_^_^) be the first ps + m points in {xt+i,Xt+2, ■ ■ ■)r\R. 

(b) Run Phase[l]with parameters Ui — {xi^ ,Xi^,. . . , Xi^^ ), V, 
Let h be the returned classifier. 

(c) Run Phase|2]with parameters U2 = {xi^^^^ , a^ip.+a , ■ ■ ■ , Xi^^^^ ), h. 
Let L be the returned labeled sequence. 

(d) Run A with parameters L and S{A). 
Let S{A), R, m, b and h be the returned values 

(e) Let t — '^ps-\-m. 
Output Hypothesis h. 



16(q+is) 



, clog 



41og2 \V\ 
5' 



P roof [Sketch] We sh ow that, for Vxy G BN(C, a), running Algorithm |2] with the algorithm A 
of lKoltchinskiil (120101) returns a classifier h with err(/i) <rj + e using a number of queries as in the 
claim. 

For bounded noise, with noise bound a, on each round of Algorithm 4, we run Algorithm 1 
on a set Ui that, by Hoeffding's inequality and the size of ps, with probability 1 — 5/\og{l/e), 
has minh^y eTTi(^{h) < a + e. Thus, by Lemma [3l the fraction of examples in each Ui = 
(xjj, . . . on which the returned h makes a mistake is at most 10(a + e). Then the size of 

ps and Hoeffding's inequality implies that err(/i) < 0{a + e) with probability 1 — 5/log(l/e), 
and a Chernoff bound implies that Algorithm 2 is run on a set U2 with eTTi(^{h) < 0(a + e + 
y/ {a + e) log(log(l/e)/(^)/m + log(Iog(l/e)/(5)/m). Thus, by Lemmas [3] and IH the number of 
queries per round is 0{k{a + e)m + k^y (a + e)mlog(l og(l/e)/(5) + kd\og{d/e6{l — 2a))). 

In particular, for the algorithm of lKoltchinskii (l2010l) . it is known that with probability 1 — 5/2, 
every round has m < O ( ifl^^ log ( s(^-2 w ) ' there are at most 0(log(l/e)) rounds, so 



(l-2a)^ ye6[l-2a) 

that the total number of queries is at most O (k {a9{e) + 1) (x-2a)^ ^^^"^ ( eS(i-2a) 



The significance of this result is that 9{e) is multiplied by q, a feature not present in the known 
results for active learning. In a sense, this factor of 6{e) is a measure of how difficult the active 
learning problem is, as the other terms are inevitable (up to the log factors). 

As before, since the value of the noise bound a is typically not known in practice, it is often 
desirable to have an algorithm capable of adapting to the value of a, while maintaining the query 
complexity guarantees of Algorithm |2] Fortunately, we can achieve this by a similar argument to 
that used above in Theorem [6l That is, starting with an initial guess of a = e as the noise bound 
argument to Algorithm |2l we use the budget argument to Phase |2] to guarantee we never exceed the 
query complexity bound of Theorem [8] (with a in place of q), halting early if ever Phase |2] fails 
to label the entire Ui set within its query budget. Then we repeatedly double a until finally this 
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modified Algoritlim|2]runs to completion. Setting the budget sizes and 6' values appropriately, we 
can maintain the guarantee of Theorem [8] with only an extra log factor increase. 

4.1 Adapting to Unknown a 

Algorithm 4 is based on having direct access to the noise bound a. As in Section [3^ since this in- 
formation is not typically available in practice, we would prefer a method that can obtain essentially 
the same query complexity bounds without direct access to a. Fortunately, we can achieve this by a 
similar argument to Section [l!2l merely by doubling our guess at the value of a until the algorithm 
behaves as expected, as follows. 

Consider modifying Algorithm 4 as follows. In Step 6, we include the budget argument to 
Algorithm 2, with value 0((1 + am) log{l/ 6')). Then, if the set L returned has \L\ < m, we 
return Failure. Note that if this a is at least as large as the actual noise bound, then this bound is 
inconsequential, as it will be satisfied anyway (with probability 1 — 6', by a Chernoff bound). Call 
this modified method Algorithm 4'. 

Now consider the sequences = 2*~^e, for 1 < i < log2(l/e). For i = 1,2, ... , log2(l/e) 
in increasing order, we run Algorithm 4' with parameters {xi,X2, . . .), e, a^, A. If the algorithm 
runs to completion, we halt and output the h returned by Algorithm 4'. Otherwise, if the algorithm 
returns Failure, we increment i and repeat. 

Since Algorithm 4' runs to completion for any i > [log(a/e)], and since the number of queries 
Algorithm 4' makes is monotonic in its a argument, for an appropriate choice of 6' = 0{8€^ /d) 
(based on a coarse bound on the total number of batches the algorithm will request labels for), 
we have a total number of queries at most O ( (1 + aO{e)) ,-. vi log^ ( ^^/-ilory'i 

) log (i)) for the 



method of iKoltchinskiil \20\ &i. only a 0(log(l/e)) factor over the bound of Theorem[8l similarly. 



we lose at most a factor of 0(log(l/e)) for the splitting method, compared to the bound of Theo- 
rem [121 

4.2 Bounds Based on the Splitting Index 

By the same reasoning as in the proof of Theorem [H except running Algor ithm [2| wit h Algo rithm [3] 



instead, one can prove an analogous bound based on the splitting index of Dasguptal ( 2005 ). rather 



than the disagreement coefficient. This is interesting, in that one can also prove a lower bound 
on QC^L in terms of the splitting index, so that composed with Theorem |71 we have a nearly tight 
charac terization of QC nnnif-, S, V, BN(C, a; 'Dx))- Specifically, consider the following definitions 



due to lDasguptal (120051) . 



Let Q C {{h, g} : h, g € C} he a finite set of unordered pairs of classifiers from C. For x ^ X 
and y ^ y, define Q% = {{h, g} £ Q : h{x) = g{x) = y}. A point x € ^ is said to p-split Q if 

max|g^|<(l-p)[Q|. 

Fix any distribution Vx on X. We say C C is (p, A, T)-splittable if for aU finite Q C {{/i, g^ C 
C:¥vAx--h{x)^g{x)) > A}, 

Px)x(x : X p-sphts Q) > T. 
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A large value of p for a rea sonablv la r ge r i ndicates that there are highly informative examples that 
are not too rare. Following |Dasgugta| (120051) . for each /i G C, r > 0, e > 0, we define 

PhA^) = sup{/0 : VA > e/2, B{h, 4A) is (p, A, T)-splittable}. 



Here, B{h, r) = {g ^ C : Px)^(x : h{x) / g{x)) < r} for r > 0. Though iDasguptal (l2005h ex- 
plores results on the query complexity as a function of h*, Vx, for our purposes (minimax analysis) 
we will take a worst-case value of p. That is, define 

Prie) = mi ph,r{e)- 

Theorem |7] relates the query complexity of CCQ to that of AL. There is much known about 
the latter, and in the interest of st ating a concrete r esult here, we briefly describe a particularly tight 
result, inspired by the analysis of lDasguptal (120051) . 

Lemma 9 There exist universal constants ci, C2 € (0, oo) such that, for any concept space C of 
Natarajan dimension d, any a £ [0, 1/2), €,5 (z (0, 1/16), and distribution Vx over X, 

inf < QCAL(e, ^, C, BN(C, a; Vx)) < mi . ^'f' . . log^ ^ ^ 



r>op^{4e)-^ Ai.v , , , V , > -r>o(l-2a)Vr(e) \e6T{l - 2a) 

The proof of Lemma|9]is included in Appendix IbI The implication of the lower bound given by 
Theorem |7J combined with Lemma |9] is as follows. 

Corollary 10 There exists a universal constant c £ (0, oo) such that, for any concept space C of 
Natarajan dimension d, any a G [0, 1/2), €,5 (z (0, 1/32), and distribution Vx over X, 

QCccq(..*.C.BN(C,»;I>.v)) > ^ ■ inf -iL_ _ 41„ (4) . 

In particular, this means that in some cases, the query complexity of CCQ learning is only 
smaller by a factor proportional to a compared to the number of random labeled examples required 
by passive learning, as indicated by the following example, which follows immediately from Corol- 
lary [TOl and Dasgupta's analysis of the splitting index for interval classifiers (IDasguptal l2005h . 

Corollary 11 For X = [0, 1] and C = {21^^ ^,] — 1 : a, 6 € [0, 1]} the class of interval classifiers, 
there is a constant c € (0,1) such that, for any a € [0,1/2) and sufficiently small e > 0, 

OL 

QCccQ(e,l/32,C,BN(C,Q)) >c-. 

There is also a near-matching upper bound compared to Corollary [TOl That is, running Algo- 
rithm |2] with Algorithm [3] of Appendix |Bl we have the following result in terms of the splitting 
index. 

Theorem 12 For any concept space C of Natarajan dimension d, and any a G [0, 1/2), for any 
distribution Vx over X, 



QCccQ(e,'5,C,BN(C,a;Px)) 



O I fcdlog^ ( ^ , " ^ , ) + inf — -- log' 



d \ . „ akd^ 



e5T{l-2a)J r>o (1 - 2Q;)Vr(e) \e6T{l-2a) 
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Logarithmic factors and terms unrelated to e and a aside, in spirit the combination of Corol- 
lary [TO] with Theorem [12] imply that in the bounded noise model, the specific reduction in query 
complexity of using class-conditional queries instead of label request queries is essentially a factor 
of a. 

5. Other types of queries 

Though the results of this paper are formulated for class conditional queries, similar arguments can 
be used to study the query complexity of other types of queries as well. For instance, as is evident 
from the fact that our methods interact with the oracle only via the Find-Mistake subroutine, all of 
the results in this work also apply (up to a factor of k) to a kind of sample-based equivalence query, 
in which we provide a sample of unlabeled examples to the oracle along with a classifier /i, and the 
oracle returns an instance in the sample on which h makes a mistake, if one exists. 

6. Conclusions 

In this paper we propose and study an extension of the standard active learning model where more 
general class-conditional queries are allowed, focusing on the problem of learning in the presence 
of noisy data. We give nearly tight upper and lower bounds on the number of queries needed to 
learn both for the general agnostic setting and for the bounded noise model. Our analysis provides a 
clear picture into the power of these queries in realistic statistical learning settings, which may help 
to inform their use in practical learning problems, as well as provide a point of reference for future 
exploration of the general topic of interactive machine learning. 
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Appendix A. Useful Facts 

Lemma 13 Let Bi, . . . ,Bk be independent Geometric(a) random variables. With probability at 
least 1 — 5, 

Proof Letm = § (fc + 41n (^)). LetXi,X2, . . . bei.i.d. BernouUiCa) random variables. X]i=i 
is distributionally equivalent to a value N defined as the smallest value of n for which Y^^=i -^i — ^' 
so it suffices to show P(A^ < m) > 1 — 6. 

Let H = YT=i ^i- We have E[i7] = am > 2k. By a Chernoff bound, we have 

¥{H <k) <¥{H < {l/2)¥.[H\) < exp {-E[i/]/8} < exp | - In | = (5. 

Therefore, with probability 1 — (5, we have < m, as claimed. ■ 



Appendix B. Splitting Index Bounds 

We prove Lemma |9] in two parts. Fir s t, we establish the lower bound. The technique for this 
is quite similar to a result of lOasguptal (12005 ). Recall that QCA^L(e, 5, C, 3?ealizable(C; T>x)) < 
QCalI^ ) 5, C, BN(C, a; Vx))- Thus, the following lemma implies the lower bound of Lemma|9l 

Lemma 14 For any hypothesis class C of Natarajan dimension d, for any distribution T>x over X, 
QCAL(e, l/16,C,3lealizable(C;Px)) > inf 



r>o p^(4e) 

Proof The proof is quite similar to that of a related result of basg uptal ( 2005 ). Fix any r G 



(0, 1/4), and suppose A is an active learning algorithm that considers at most the first 1/(4t) 
unlabeled examples, with probability greater than 7/8. Let /i € C be such that p/j T-(4e) < 2/9T-(4e), 
and let A > 2e and Q C {{f,g} C B{h,4A) : P©-^(x : f{x) ^ g{x)) > A} be such that 
fD^{x : X 2/9/1,^ (4e)-sphts Q) < r. In particular, with probability at least (1 - t)^/(^^) > 3/4, 
none of the first 1/(4t) unlabeled examples 2/3/i T-(4e) -splits Q. Fix any such data set, and denote 

We proceed by the probabilistic method. We randomly select the target h* as follows. First, 
choose a pair {/*,(?*} E Q uniformly at random. Then choose h* from among {f*,g*} uniformly 
at random. 

For each unlabeled example x among the first 1/(4t), call the label y with \Qx\ > {I — p)\Q\ 
the "bad" response. Given the initial 1 / (4t) unlabeled examples, the algorithm A has some fixed (a 
priori known, though possibly randomized) behavior when the responses to all of its label requests 
are the bad responses. That is, it makes some number t of queries, and then returns some classifier 
h. 

For any one of those label requests, the probability that both /* and g* agree with the bad 
response is greater than I — p. Thus, by a union bound, the probability both /* and g* agree 
with the bad responses for the t queries of the algorithm is greater than 1 — tp. On this event, the 
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algorithm returns h, which is independent from the random choice of h* from among /* and g* . 
Since Px)x {x : f*{x) ^ g*{x)) > A > 2e, /i can be e-close to at most one of them, so that there is 
at least a 1/2 probability that err(/i) > e. 

Adding up the failure probabilities, by a union bound the probability the algorithm's returned 
classifier h' has err(/i') > e is greater than 7/8 — 1/4 — tp — 1/2. For any t < l/(16p), this 
is greater than 1/16. Thus, there exists some deterministic h* G C for which A requires at least 
l/(16p) queries, with probability greater than 1/16. 

As any active learning algorithm has a 7/8-confidence upper bound M on the number of unla- 
beled examples it uses, letting r ^ in the above analysis allows M — )• oo, and thus covers all 
possible active learning algorithms. ■ 

We will establish the upper bound portion of Lemma |9] via the following algorithm. Here we 
write the algorithm in a closed form, but it is clear that we could rewrite the method in the batch- 
based style required by Algorithm [2] above, simply by including its state every time it makes a batch 
of label request queries. The value eo in this method should be set appropriately for the res ult below 



specifi cally, we will coarsely take eq = 0((1 — 2a)^er^(5/(i^), based on the analysis of iDasgupta 



(2005) for the realizable case. 



Algorithm 3 An active learning algorithm for learning with bounded noise, based on splitting. 
Input: The sequence U — {xi, X2, ...); allowed error rate e; value t G (0, 1); noise bound a e [0, 1/2). 

I. Let V denote a minimal eo-cover of C 

II. For each pair of classifier h,g ^ V, initialize Mhg = 

in.Forr=l,2,...,[log2(2/e)l 

1. Consider the set Q QV^ of pairs {h, g} C V with Pp^ (x : h{x) ^ g{x)) > 2"^ 

2. While (IQI > 0) 

(a) LetS'==0 

(b) Do O ( (dlog (i) + log (i))) times 

i. Let Q = Q 

ii. While (IQI > 0) 

A. From among the next 1/t unlabeled examples, select the one x with minimum 
maxy^y IQ^I, and let y denote the maximizing label 

B. S-^S-Uji} 

C. Q^Ql 

(c) Request the labels for all examples in S, and let L be the resulting labeled examples 

(d) For each h,g£V, let Ahg ^ Mhg + \{ix,y) eL: h{x) ^y = g{x)}\ 



(e) Let 1/ ^ j/i e y : V.g G V, Mhg - Mgh < O (^^mSix{Mhg, Mgh}d\og (^j^^ + d\og (^) 

(f) LetQ^{{h,g}eQ:h,geV} 
Output Any hypothesis h <E V. 
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We have the following result for this method, with an appropriate setting of the constants in the 
"0(-)" terms. 

Lemma 15 There exists a constant c G (0, oo) such that, for any hypothesis class C of Natarajan 
dimension d, for any a € [0,1 /2) and r > 0, for any distribution T>x over X, for any T>xY £ 
BN(C, a; T>x), Algorithm\3\produces a classifier h with err(/i) < rj + e using a number of label 
request queries at most 

Proof [Sketch] Since V is initially an eo-cover, the h ^ V minimal err(/i) has err(/i) < eo- 
Furthermore, eo was chosen so that, as long as the total number of unlabeled examples processed 
does not exceed 0( (x_2a)^(;T^ )' ^^^^ probability 1 — 0{6), we will have h agreeing with h* on 
all of the unlabeled examples, and in particular on all of the examples whose labels the algorithm 
requests. This means that, for every example x we request the label of, ¥{h{x) = y [x) > 1 — a. By 
Chernoff and union bounds, with probability 1 — 0{5), for every (7 G we always have 



^hg -Mgh<0 \Jmax{Mhg,Mgh}d\og Q-^ + dlog (^1) j , 

so that we never remove h from V. Thus, for each round T, the set V C B(/i*,4Ar), where 
At = 2~^. In particular, this means the returned h is in B{h*, e), so that err(/i) < i] + e. 

Also by Chernoff and union bounds, with probability 1 — 0{6), any g £ V with M^^ + M^^ > 

o((T^logi^) has 



^9h -^hg>0 max{A4g, Mgh}d\og (^^^ + dlog (^^ 

so that we remove it from V at the end of the round. 

That V C 4Ar) also means V is (p, A^, T)-splittable, for p = p/i* ,-(e). In particular, 

this means we get a p-splitting example for Q every ^ examples (in expectation). Thus, we always 

satisfy the |Q| = condition after at most O log^ rounds of the inner loop (by Chernoff and 
union bounds, and the definition of p). Furthermore, among the examples added to S during this 
period, regardless of their true labels we are guaranteed that at least 1/2 of pairs {/i, g] in Q have at 
least one of (M^^ + M^^) or (M^/j + ^hg) incremented as a result: that is, for at least \Q\/2 pairs, 

at least one of the two classifiers disagrees with h on at least one of these examples. Thus, after 
executing this O ^ dlog {j^^ times, we are guaranteed that at least half of the {/ii, /12} 

pairs in Q have (for some i € {1, 2}) + ^ > O (^ (iJ^^)'! log , thus reducing \Q\ by at 
least a factor of 2. Repeating this log \Q\ = 0{dlog{l/eo)) times satisfies the \Q\ = condition. 
Thus, the total number of queries is at most 

r^f 1 , 5 1 

V(l - 2ay p eo 
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